그래프 기반 준지도 학습을 이용한 속성값 전파 결측치 추정

신유경; 신현정; Yukyung Shin; Hyunjung Shin

연구문헌

국내 논문지

홈 > 연구문헌 > 국내 논문지 > 한국정보과학회 논문지 > 정보과학회 컴퓨팅의 실제 논문지 (KIISE Transactions on Computing Practices)

정보과학회 컴퓨팅의 실제 논문지 (KIISE Transactions on Computing Practices)

Current Result Document :

한글제목(Korean Title)	그래프 기반 준지도 학습을 이용한 속성값 전파 결측치 추정
영문제목(English Title)	Missing Value Imputation with Attribute Value Propagation using Graph-based Semi-Supervised Learning
저자(Author)	신유경 신현정 Yukyung Shin Hyunjung Shin
원문수록처(Citation)	VOL 25 NO. 10 PP. 0511 ~ 0516 (2019. 10)
한글내용 (Korean Abstract)	데이터의 레코드들 중에 하나 이상의 속성값이 없는 경우는 비일비재하다. 많은 경우에 있어서 데이터의 수 대비 결측치가 없는 완전레코드의 수의 비율이 적다. 이에 대하여 평균값, 최빈값, 그리고 중앙값 등으로 대체하는 통계적 방법이 가장 보편적으로 쓰이고 있다. 또한 기계학습에서도 k-최근접 이웃탐색이나 의사결정나무 등을 활용한 결측치 추정방법들이 자주 활용된다. 전자는 각 속성의 대표하는 값으로 대체하는 전역적 방법인데 반해 후자는 해당 레코드와 유사한 레코드들의 속성값으로 대체하는 지역적 방법이라 할 수 있다. 그러나 한 속성의 값이 대부분 결측된 경우라면 두 방법 모두 활용하기 어렵다. 이러한 한계를 극복하기 위하여, 본 연구에서는 결측치의 속성과 상관성이 큰 이웃 속성들로부터 값을 추정하는 방법을 제안한다. 속성 간 상관성을 기반으로 하여 한 속성의 대부분의 값이 결측이 되더라도 활용 할 수 있다. 제안 방법론으로는 속성들 간의 상관계수로 이루어진 상관 그래프를 만들고, 그래프 기반 준지도 학습을 적용한다. 결측치는 다른 속성값들로부터 상관계수에 비례하여 전파되어 추정된다. 본 논문에서 제안한 결측치 대체 추정 방법과 기존에 결측치 대체에 많이 사용하는 통계적 방법과 기계학습을 비교하여 실험을 진행하였다.
영문내용 (English Abstract)	The number of data records without one or more attributes is very large. In many cases, few complete records are available without missing the data values. Statistical methods that replace the missing values with mean, mode and median are commonly used. In machine learning algorithms such as K-nearest neighborhood or decision tree, the missing values are replaced by estimation methods. The statistical method is a global method that replaces each attribute with a representative value, whereas the machine learning algorithm is a local method that replaces the attribute values similar to the records. However, it is difficult to use both methods for records that contain almost all the missing values. In order to overcome these limitations, in this paper, we propose a method to estimate values from neighborhood properties associated with large correlation with the missing attribute. It is based on correlation between attributes, and can be used even if the attributes carry almost missing values. In this proposed method, a correlation graph representing correlation coefficients related to attribute values was constructed based on graph-based semi-supervised learning. Missing values were estimated in proportion to the correlation coefficient derived from related attributes. In this paper, the proposed method compared the statistical method and machine learning algorithm, which are generally used for missing value imputation.
키워드(Keyword)	준지도 학습 그래프 이론 기계학습 결측치 대체 semi-supervised learning graph theory machine learning missing value imputation
파일첨부	PDF 다운로드